libraries
#graph plotting packages.
library(data.table)
library(ggplot2)
library(plotly)
library(gganimate)
library(d3Tree)
library(dplyr)
library(mltools)
library(corrplot)
library(kableExtra)
Reading the data into R and dropping Over18,EmployeeCount,EmployeeNumber,StandardHours because in these columns the data is either unique or the data is same for all the row in the table.
hr<-fread("WA_Fn-UseC_-HR-Employee-Attrition.csv",na.strings = "NA",stringsAsFactors = TRUE,drop = c("Over18","EmployeeCount","EmployeeNumber","StandardHours"))
Now seing the structure of the data to get some idea of type of data present in the data set.
str(hr)
Classes ‘data.table’ and 'data.frame': 1470 obs. of 31 variables:
$ Age : int 41 49 37 33 27 32 59 30 38 36 ...
$ Attrition : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
$ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
$ DailyRate : int 1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
$ Department : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
$ DistanceFromHome : int 1 8 2 3 2 2 3 24 23 27 ...
$ Education : int 2 1 2 4 1 2 3 1 3 3 ...
$ EducationField : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
$ EnvironmentSatisfaction : int 2 3 4 4 1 4 3 4 4 3 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
$ HourlyRate : int 94 61 92 56 40 79 81 67 44 94 ...
$ JobInvolvement : int 3 2 2 3 3 3 4 3 2 3 ...
$ JobLevel : int 2 2 1 1 1 1 1 1 3 2 ...
$ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
$ JobSatisfaction : int 4 2 3 3 2 4 1 3 3 3 ...
$ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
$ MonthlyIncome : int 5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
$ MonthlyRate : int 19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
$ NumCompaniesWorked : int 8 1 6 1 9 0 4 1 0 6 ...
$ OverTime : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
$ PercentSalaryHike : int 11 23 15 11 12 13 20 22 21 13 ...
$ PerformanceRating : int 3 4 3 3 3 3 4 4 4 3 ...
$ RelationshipSatisfaction: int 1 4 2 3 4 3 1 2 2 2 ...
$ StockOptionLevel : int 0 1 0 0 1 0 3 1 0 2 ...
$ TotalWorkingYears : int 8 10 7 8 6 8 12 1 10 17 ...
$ TrainingTimesLastYear : int 0 3 3 3 3 2 3 2 2 3 ...
$ WorkLifeBalance : int 1 3 3 3 3 2 2 3 3 2 ...
$ YearsAtCompany : int 6 10 0 8 2 7 1 1 9 7 ...
$ YearsInCurrentRole : int 4 7 0 7 2 7 0 0 7 7 ...
$ YearsSinceLastPromotion : int 0 1 0 3 2 3 0 0 1 7 ...
$ YearsWithCurrManager : int 5 7 0 0 2 6 0 0 8 7 ...
- attr(*, ".internal.selfref")=<externalptr>
Now viewing the summary of all numeric column of the data set.
summary(hr)
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField
Min. :18.00 No :1233 Non-Travel : 150 Min. : 102.0 Human Resources : 63 Min. : 1.000 Min. :1.000 Human Resources : 27
1st Qu.:30.00 Yes: 237 Travel_Frequently: 277 1st Qu.: 465.0 Research & Development:961 1st Qu.: 2.000 1st Qu.:2.000 Life Sciences :606
Median :36.00 Travel_Rarely :1043 Median : 802.0 Sales :446 Median : 7.000 Median :3.000 Marketing :159
Mean :36.92 Mean : 802.5 Mean : 9.193 Mean :2.913 Medical :464
3rd Qu.:43.00 3rd Qu.:1157.0 3rd Qu.:14.000 3rd Qu.:4.000 Other : 82
Max. :60.00 Max. :1499.0 Max. :29.000 Max. :5.000 Technical Degree:132
EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel JobRole JobSatisfaction MaritalStatus
Min. :1.000 Female:588 Min. : 30.00 Min. :1.00 Min. :1.000 Sales Executive :326 Min. :1.000 Divorced:327
1st Qu.:2.000 Male :882 1st Qu.: 48.00 1st Qu.:2.00 1st Qu.:1.000 Research Scientist :292 1st Qu.:2.000 Married :673
Median :3.000 Median : 66.00 Median :3.00 Median :2.000 Laboratory Technician :259 Median :3.000 Single :470
Mean :2.722 Mean : 65.89 Mean :2.73 Mean :2.064 Manufacturing Director :145 Mean :2.729
3rd Qu.:4.000 3rd Qu.: 83.75 3rd Qu.:3.00 3rd Qu.:3.000 Healthcare Representative:131 3rd Qu.:4.000
Max. :4.000 Max. :100.00 Max. :4.00 Max. :5.000 Manager :102 Max. :4.000
(Other) :215
MonthlyIncome MonthlyRate NumCompaniesWorked OverTime PercentSalaryHike PerformanceRating RelationshipSatisfaction StockOptionLevel TotalWorkingYears
Min. : 1009 Min. : 2094 Min. :0.000 No :1054 Min. :11.00 Min. :3.000 Min. :1.000 Min. :0.0000 Min. : 0.00
1st Qu.: 2911 1st Qu.: 8047 1st Qu.:1.000 Yes: 416 1st Qu.:12.00 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:0.0000 1st Qu.: 6.00
Median : 4919 Median :14236 Median :2.000 Median :14.00 Median :3.000 Median :3.000 Median :1.0000 Median :10.00
Mean : 6503 Mean :14313 Mean :2.693 Mean :15.21 Mean :3.154 Mean :2.712 Mean :0.7939 Mean :11.28
3rd Qu.: 8379 3rd Qu.:20462 3rd Qu.:4.000 3rd Qu.:18.00 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:1.0000 3rd Qu.:15.00
Max. :19999 Max. :26999 Max. :9.000 Max. :25.00 Max. :4.000 Max. :4.000 Max. :3.0000 Max. :40.00
TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
Min. :0.000 Min. :1.000 Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.:2.000 1st Qu.:2.000 1st Qu.: 3.000 1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 2.000
Median :3.000 Median :3.000 Median : 5.000 Median : 3.000 Median : 1.000 Median : 3.000
Mean :2.799 Mean :2.761 Mean : 7.008 Mean : 4.229 Mean : 2.188 Mean : 4.123
3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.: 9.000 3rd Qu.: 7.000 3rd Qu.: 3.000 3rd Qu.: 7.000
Max. :6.000 Max. :4.000 Max. :40.000 Max. :18.000 Max. :15.000 Max. :17.000
colnames(hr)
[1] "Age" "Attrition" "BusinessTravel" "DailyRate" "Department"
[6] "DistanceFromHome" "Education" "EducationField" "EnvironmentSatisfaction" "Gender"
[11] "HourlyRate" "JobInvolvement" "JobLevel" "JobRole" "JobSatisfaction"
[16] "MaritalStatus" "MonthlyIncome" "MonthlyRate" "NumCompaniesWorked" "OverTime"
[21] "PercentSalaryHike" "PerformanceRating" "RelationshipSatisfaction" "StockOptionLevel" "TotalWorkingYears"
[26] "TrainingTimesLastYear" "WorkLifeBalance" "YearsAtCompany" "YearsInCurrentRole" "YearsSinceLastPromotion"
[31] "YearsWithCurrManager"
Now we are grouping the ages bases based on age groups. 20-belowe 20-30 30-40 40-50 50-60 so now we are dividing our analysis on different age group.
hr[,Age:=as.factor( ifelse(
Age <= 20 , '15-20',
ifelse(Age >20 & Age <= 30,'20-30' ,
ifelse(Age > 30 & Age <= 40, '30-40',
ifelse(Age >40 & Age <=50,'40-50','50-60') )
)
)
)
]
colnames(hr)
[1] "Age" "Attrition" "BusinessTravel" "DailyRate" "Department"
[6] "DistanceFromHome" "Education" "EducationField" "EnvironmentSatisfaction" "Gender"
[11] "HourlyRate" "JobInvolvement" "JobLevel" "JobRole" "JobSatisfaction"
[16] "MaritalStatus" "MonthlyIncome" "MonthlyRate" "NumCompaniesWorked" "OverTime"
[21] "PercentSalaryHike" "PerformanceRating" "RelationshipSatisfaction" "StockOptionLevel" "TotalWorkingYears"
[26] "TrainingTimesLastYear" "WorkLifeBalance" "YearsAtCompany" "YearsInCurrentRole" "YearsSinceLastPromotion"
[31] "YearsWithCurrManager"
In our data we have some features “Education”, “EnvironmentSatisfaction”,“JobInvolvement”,“JobLevel”,“JobSatisfaction”,“PerformanceRating”,“RelationshipSatisfaction”,“StockOptionLevel”," WorkLifeBalance" which are numeric but the sense they make is factors,so now converting those variable to factors.
fact_columns<-c("Education", "EnvironmentSatisfaction","JobInvolvement","JobLevel","JobSatisfaction","PerformanceRating","RelationshipSatisfaction","StockOptionLevel","WorkLifeBalance")
hr<-as.data.frame(hr)
hr[fact_columns]<-lapply(hr[fact_columns], as.factor)
hr<-as.data.table(hr)
Analysis on age group.
Now we are comparing which age groups are travelling the most.
ggplot(hr[,.N,by=list(Age,BusinessTravel,Gender)],aes(x=Age,y=N,fill=Gender))+geom_bar(stat = "identity",width = 0.6,position = "dodge")+facet_grid(~BusinessTravel)+geom_text(aes(label=N), vjust=-0.3)+labs(x ="The Age Groups", y = "Number of Employee")
salary_monthly_dep<-function(age,typeTravel){
x<-ggplot(hr[Age==age& BusinessTravel==typeTravel,.N,by=list(JobRole,JobSatisfaction,MonthlyIncome, Department)],aes(x=JobRole,y=MonthlyIncome,fill=JobRole))+geom_bar(stat="identity",width = 0.3) +facet_wrap(~Department,ncol = 1,scales = "free")
ggplotly(x)
}
salary_monthly_dep("30-40","Travel_Rarely")
#for movement
#transition_states(EnvironmentSatisfaction, .01, .001)+ease_aes('cubic-in-out')
#for plotting multiplot on one page
#multiplot(p1, p2, p3, p4, cols=2)
Now finding the insights regarding employee turn over. so we found few feature which are very important for employee turnover. MonthlyIncome ,TotalWorkingYears,OverTime ,JobLevel JobRole ,Age, EnvironmentSatisfaction ,YearsAtCompany
what percentage of employees leaving the company.
# Create test data.
dat = data.frame(count=c(10, 60, 30), category=c("A", "B", "C"))
# Add addition columns, needed for drawing with geom_rect.
dat$fraction = dat$count / sum(dat$count)
dat = dat[order(dat$fraction), ]
dat$ymax = cumsum(dat$fraction)
dat$ymin = c(0, head(dat$ymax, n=-1))
# Make the plot
p1 = ggplot(dat, aes(fill=category, ymax=ymax, ymin=ymin, xmax=4, xmin=3)) +
geom_rect() +
coord_polar(theta="y") +
xlim(c(0, 4)) +
theme(panel.grid=element_blank()) +
theme(axis.text=element_blank()) +
theme(axis.ticks=element_blank()) +
annotate("text", x = 0, y = 0, label = "My Ring plot !") +
labs(title="")
Now ploting the scatter plot for to see the relationship between the monthly income ,Total working years and the attrition rate.
x<-ggplot(hr,aes(TotalWorkingYears,MonthlyIncome,color=JobLevel,shape=OverTime))+geom_point()+facet_wrap(~Attrition)+labs(x = "Total Working Years",y="Monthly_Income")
ggplotly(x)
As we see in the above graph that the employees who ’s monthly income is less than 10000$ ,total working years in range 0-20 ,job level 1-3 are having the maximum attrition.
so now i am dividing my analysis into 2 parts as in monthly_income less than 5000 and 5000-10000 different as different job level people are :- 1)analysis on employee who ’s monthly income is less than 5000$ 2)analysis on employee who ’s monthly income is in between 5000$ - 10000$
Now doing analysis on the employees who ’s salary is less than 5000$ Now we further dividing our analysis in to two parts total working years 1) Less than 5 2) Between 5 to 10 yeras
Now I am going to view job level of employee and wheather the employee is working over time ,of those employee ’s who ’s monthly income is less than 5000$ and total woking years is in the range 0-5 years.
x<-ggplot(hr[MonthlyIncome<=5000 & TotalWorkingYears <= 5 & Attrition=="Yes",.N,by=list(JobLevel,OverTime,MonthlyIncome,TotalWorkingYears)],aes(TotalWorkingYears,MonthlyIncome,color=JobLevel))+geom_point()+facet_wrap(~OverTime)+labs(x = "Total Working Years",y="Monthly_Income")+labs(title = "Over Time",fill="Age Group")
ggplotly(x)
so by seing the above graph we can infer that job level 1 and 2 employee ’s who ’s salary is less than 5000$ and the total working years is less than 5 have high probability of leaving the company,their is no much difference of employee turnover in these employees
But if we see in more detail 1 year experience and job level 1 employee are leaving the most.
Now we are checking their Environment satisfaction rating ,job,roles,Age group.
x<-ggplot(hr[MonthlyIncome <= 5000 & TotalWorkingYears <= 1 & JobLevel == 1 &Attrition=="Yes",.N,by=list(JobRole,EnvironmentSatisfaction,Age)],aes(x=JobRole,y=N,fill=Age))+geom_bar(stat="identity",width = 0.3) +facet_wrap(~EnvironmentSatisfaction ,ncol = 1,scales = "free")+coord_flip()+labs(title = "EnvironmentSatisfaction Rating (1-4)",y="The Number of Employees moving out of the company. ",fill="Age Group")
ggplotly(x)
By the above graph we can infer by job role that
There are total 12 employees where
At the age group of 30-40 the employee was not satisfied.At the age group of 15-20 the employee are very much satisfied with the Environment but still leaving the company.
so now we are going to find out why at age group of 15-20 ,who ’s environment satisfaction is high and very high are still leaving the company.
The maximum monthly income of one year experience research Scientist by the historical data of the company is : 2994 The minimum monthly income of one year experience research scientist by the historical data of the company is : 1009
The maximum monthly income of research Scientist by the historical data of the company is :9724 The minimum monthly income of research scientist by the historical data of the company is :1009
x<-ggplot(hr[MonthlyIncome<=5000 & TotalWorkingYears == 1 &Attrition=="Yes"& (EnvironmentSatisfaction==3 | EnvironmentSatisfaction==4)&JobRole=="Research Scientist",.N,by=list(OverTime,MonthlyIncome,YearsAtCompany,WorkLifeBalance)],aes(YearsAtCompany,MonthlyIncome,fill= WorkLifeBalance ))+geom_point()+facet_wrap(~OverTime)+labs(x = "",y="Monthly_Income")+labs(title = "Over Time",x="years at the company",fill="Work Life Balance")
ggplotly(x)
The most of the employee who came to this post in this company are freshers and they got 1 year of experience in the current company. The research scientist who are doing over time and getting the similar monthly income as the employee who are not doing over time are leaving the company. their work life balance is good.
x<-ggplot(hr[MonthlyIncome<=5000 & TotalWorkingYears == 1 &Attrition=="Yes"& (EnvironmentSatisfaction==3 | EnvironmentSatisfaction==4)&JobRole=="Research Scientist"&OverTime=="Yes",.N,by=list(BusinessTravel,StockOptionLevel ,DistanceFromHome ,YearsWithCurrManager, JobInvolvement ,PercentSalaryHike,RelationshipSatisfaction, EducationField)],aes(PercentSalaryHike,RelationshipSatisfaction ,fill= JobInvolvement ))+geom_point()+facet_wrap(~EducationField)+labs(x = "PercentSalaryHike",y="RelationshipSatisfaction ")+labs(title = "EducationField",fill="JobInvolvement")
ggplotly(x)
The employees who are frequently travelling the bussiness trips are generally moving out of the company and their job involment is also medium.Their relationshipsatisfaction rating also is very low
The minimum percentage hikes given to research scientist is : 11% The maximum percentage hikes given to research scientist is : 25%
so the percentage hikes are also given good to these employees.
x<-ggplot(hr[MonthlyIncome <= 5000 & TotalWorkingYears <= 5 & (JobLevel == 2 | JobLevel == 1) &Attrition=="Yes",.N,by=list(JobRole,Department,Attrition,EnvironmentSatisfaction,Age)],aes(x=JobRole,y=N,fill=Age))+geom_bar(stat="identity",width = 0.3) +facet_wrap(~EnvironmentSatisfaction ,ncol = 1,scales = "free")+coord_flip()+labs(title = "EnvironmentSatisfaction Rating (1-4)",y="The Number of Employees moving out of the company. ",fill="Age Group")
ggplotly(x)
As we can see in the above graph we can infer that For the employee ’s who ’s monthly income is below 5000$ ,total working years is less than 5 ,we have only two job levels
For job level 1 these are the job roles and department : 1:Laboratory Technician (Research & Development department) 2:Research Scientist (Research & Development department) 3:Sales Representative (Sales department) 4:Human Resources (Human Resources department)
For job level 2 these are the job roles and department : 1:Manufacturing Director (Research & Development department) 2:Sales Executive (Sales department) 3:Healthcare Representative (Research & Development department) 4:Sales Representative (Sales department) 5:Laboratory Technician (Research & Development department)
By seing the EnvironmentSatisfaction rating of the employee who are leaving the company we can infer that :
At Laboratory Technician job role we have very high number of employees turnover,their Environment satisfaction rating is low in all the age group ’s of the employee.
We can see from the above graph in job role of Sales Representative(Sales department),Research Scientist (Research & Development department),Laboratory Technician (Research & Development department) these employee are leaving the company in a larger scale as compare to the other job roles..
Age YearsAtCompany EnvironmentSatisfaction MaritalStatus
Now I am going to look age,yearsAtcompany,Environmentsatisfaction,MaritalStatus.of the employees who ’s job role are Sales Representative(Sales department),Research Scientist (Research & Development department),Laboratory Technician (Research & Development department)
nn <- hr[,list(Age,YearsAtCompany,EnvironmentSatisfaction,MaritalStatus)]
d3tree(list(root = df2tree(rootname = 'Overall',struct = as.data.frame(nn)),
layout = 'collapse'))
str(hr)
Classes ‘data.table’ and 'data.frame': 1470 obs. of 31 variables:
$ Age : Factor w/ 5 levels "15-20","20-30",..: 4 4 3 3 2 3 5 2 3 3 ...
$ Attrition : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
$ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
$ DailyRate : int 1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
$ Department : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
$ DistanceFromHome : int 1 8 2 3 2 2 3 24 23 27 ...
$ Education : Factor w/ 5 levels "1","2","3","4",..: 2 1 2 4 1 2 3 1 3 3 ...
$ EducationField : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
$ EnvironmentSatisfaction : Factor w/ 4 levels "1","2","3","4": 2 3 4 4 1 4 3 4 4 3 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
$ HourlyRate : int 94 61 92 56 40 79 81 67 44 94 ...
$ JobInvolvement : Factor w/ 4 levels "1","2","3","4": 3 2 2 3 3 3 4 3 2 3 ...
$ JobLevel : Factor w/ 5 levels "1","2","3","4",..: 2 2 1 1 1 1 1 1 3 2 ...
$ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
$ JobSatisfaction : Factor w/ 4 levels "1","2","3","4": 4 2 3 3 2 4 1 3 3 3 ...
$ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
$ MonthlyIncome : int 5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
$ MonthlyRate : int 19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
$ NumCompaniesWorked : int 8 1 6 1 9 0 4 1 0 6 ...
$ OverTime : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
$ PercentSalaryHike : int 11 23 15 11 12 13 20 22 21 13 ...
$ PerformanceRating : Factor w/ 2 levels "3","4": 1 2 1 1 1 1 2 2 2 1 ...
$ RelationshipSatisfaction: Factor w/ 4 levels "1","2","3","4": 1 4 2 3 4 3 1 2 2 2 ...
$ StockOptionLevel : Factor w/ 4 levels "0","1","2","3": 1 2 1 1 2 1 4 2 1 3 ...
$ TotalWorkingYears : int 8 10 7 8 6 8 12 1 10 17 ...
$ TrainingTimesLastYear : int 0 3 3 3 3 2 3 2 2 3 ...
$ WorkLifeBalance : Factor w/ 4 levels "1","2","3","4": 1 3 3 3 3 2 2 3 3 2 ...
$ YearsAtCompany : int 6 10 0 8 2 7 1 1 9 7 ...
$ YearsInCurrentRole : int 4 7 0 7 2 7 0 0 7 7 ...
$ YearsSinceLastPromotion : int 0 1 0 3 2 3 0 0 1 7 ...
$ YearsWithCurrManager : int 5 7 0 0 2 6 0 0 8 7 ...
- attr(*, ".internal.selfref")=<externalptr>
str(hr)
Classes ‘data.table’ and 'data.frame': 1470 obs. of 31 variables:
$ Age : int 41 49 37 33 27 32 59 30 38 36 ...
$ Attrition : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
$ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
$ DailyRate : int 1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
$ Department : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
$ DistanceFromHome : int 1 8 2 3 2 2 3 24 23 27 ...
$ Education : Factor w/ 5 levels "1","2","3","4",..: 2 1 2 4 1 2 3 1 3 3 ...
$ EducationField : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
$ EnvironmentSatisfaction : Factor w/ 4 levels "1","2","3","4": 2 3 4 4 1 4 3 4 4 3 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
$ HourlyRate : int 94 61 92 56 40 79 81 67 44 94 ...
$ JobInvolvement : Factor w/ 4 levels "1","2","3","4": 3 2 2 3 3 3 4 3 2 3 ...
$ JobLevel : Factor w/ 5 levels "1","2","3","4",..: 2 2 1 1 1 1 1 1 3 2 ...
$ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
$ JobSatisfaction : Factor w/ 4 levels "1","2","3","4": 4 2 3 3 2 4 1 3 3 3 ...
$ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
$ MonthlyIncome : int 5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
$ MonthlyRate : int 19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
$ NumCompaniesWorked : int 8 1 6 1 9 0 4 1 0 6 ...
$ OverTime : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
$ PercentSalaryHike : int 11 23 15 11 12 13 20 22 21 13 ...
$ PerformanceRating : Factor w/ 2 levels "3","4": 1 2 1 1 1 1 2 2 2 1 ...
$ RelationshipSatisfaction: Factor w/ 4 levels "1","2","3","4": 1 4 2 3 4 3 1 2 2 2 ...
$ StockOptionLevel : Factor w/ 4 levels "0","1","2","3": 1 2 1 1 2 1 4 2 1 3 ...
$ TotalWorkingYears : int 8 10 7 8 6 8 12 1 10 17 ...
$ TrainingTimesLastYear : int 0 3 3 3 3 2 3 2 2 3 ...
$ WorkLifeBalance : Factor w/ 4 levels "1","2","3","4": 1 3 3 3 3 2 2 3 3 2 ...
$ YearsAtCompany : int 6 10 0 8 2 7 1 1 9 7 ...
$ YearsInCurrentRole : int 4 7 0 7 2 7 0 0 7 7 ...
$ YearsSinceLastPromotion : int 0 1 0 3 2 3 0 0 1 7 ...
$ YearsWithCurrManager : int 5 7 0 0 2 6 0 0 8 7 ...
- attr(*, ".internal.selfref")=<externalptr>
convert_to_character <- c("Attrition", "BusinessTravel","Department","Education","EducationField","EnvironmentSatisfaction","Gender","JobInvolvement","JobLevel","JobRole","JobSatisfaction","MaritalStatus","OverTime","PerformanceRating","RelationshipSatisfaction","StockOptionLevel","WorkLifeBalance")
hr_encoded=hr
hr_encoded[, convert_to_character] <- hr[, lapply(.SD, as.numeric), .SDcols = convert_to_character]
Handling the high correlation betweeen independent variable
hr_encoded <- one_hot(as.data.table(hr),dropUnusedLevels=TRUE)
hr_encoded<-hr_encoded[,-"Attrition_No"]
This graph shows the correlation with decreasing order.
hr_encoded<-as.data.frame(hr_encoded)
# remove features which are not used for correlation analysis
training <- select(hr_encoded, -c(Attrition_Yes))
# calculate correlation coefficient of each feature with survival
feature <- names(training)
corrSurvived <- data.frame(feature = feature, coef = rep(NA, length(feature)))
for (iFeature in 1:length(feature)){
corrSurvived$coef[iFeature] <- cor(training[, iFeature],hr_encoded$Attrition_Yes)
}
# sort by correlation coefficient
corrSurvivedOrder <- corrSurvived[order(corrSurvived$coef, decreasing = FALSE), ]
ggplot(corrSurvivedOrder, aes(x = factor(feature, levels = feature), y = coef,fill=coef)) +
geom_bar(stat = "identity") +
coord_flip() +
xlab("Feature") +
ylab("Correlation Coefficient")
correlation plot
# function to calculate plot feature correlation matrix
getCorrMatrix <- function(featureList, showPlot = TRUE){
# remove Survived from training set and order feature with respect to correlation coefficient to survived
passengerCorr <- hr_encoded[, as.character(featureList)]
# calculate correlation matrix
corrMatrix <- cor(passengerCorr)
# plot matrix
if (showPlot) {corrplot(corrMatrix, method = "color", type = "upper")}
corrMatrix
}
corrMatrix <- getCorrMatrix(rev(corrSurvivedOrder$feature))
# function to get data frame with pairwise correlation of features
getPairCorrelation <- function(corrMatrix){
featureName <- colnames(corrMatrix)
nFeature <- length(featureName)
# set lower triangle of matrix to NA (these values are all redundant)
corrMatrix[lower.tri(corrMatrix, diag = TRUE)] <- NA
# convert matrix to data frame
featurePair <- data.frame(feature1 = rep(featureName, nFeature), feature2 = rep(featureName, each = nFeature), coef = as.vector(corrMatrix))
# remove NAs
featurePair <- featurePair[!is.na(featurePair$coef), ]
# calculate absolute value of correlation coefficient
featurePair$coefAbs <- abs(featurePair$coef)
# order by coefficient
featurePair <- featurePair[order(featurePair$coefAbs, decreasing = TRUE), ]
featurePair
}
featureCorr <- getPairCorrelation(corrMatrix)
Lets have a look at the ten feature pairs with the highes correlations:
kable(featureCorr[1:10, ]) %>% kable_styling(full_width = FALSE)
| feature1 | feature2 | coef | coefAbs | |
|---|---|---|---|---|
| 1084 | Gender_Male | Gender_Female | -1.0000000 | 1.0000000 |
| 2551 | OverTime_Yes | OverTime_No | -1.0000000 | 1.0000000 |
| 2495 | MonthlyIncome | JobLevel | 0.9502999 | 0.9502999 |
| 1791 | Department_Sales | Department_Research & Development | -0.9068183 | 0.9068183 |
| 726 | JobRole_Human Resources | Department_Human Resources | 0.9049828 | 0.9049828 |
| 669 | Department_Sales | JobRole_Sales Executive | 0.8088693 | 0.8088693 |
| 2548 | JobLevel | TotalWorkingYears | 0.7822078 | 0.7822078 |
| 986 | PerformanceRating | PercentSalaryHike | 0.7735500 | 0.7735500 |
| 2546 | MonthlyIncome | TotalWorkingYears | 0.7728932 | 0.7728932 |
| 2287 | YearsAtCompany | YearsWithCurrManager | 0.7692124 | 0.7692124 |
# plot histogram of correlation factors
ggplot(featureCorr, aes(coef)) + geom_histogram(binwidth = 0.1) + xlab("Correlation Coefficient")
4 Feature Reduction
1)Greedy Elimination
featureGE
[1] "OverTime_No" "MaritalStatus_Single" "EducationField_Technical Degree" "JobRole_Manufacturing Director"
[5] "JobSatisfaction" "WorkLifeBalance" "DistanceFromHome" "BusinessTravel_Travel_Frequently"
[9] "MonthlyRate" "PercentSalaryHike" "JobInvolvement" "RelationshipSatisfaction"
[13] "Gender_Female" "EnvironmentSatisfaction" "HourlyRate" "EducationField_Human Resources"
[17] "DailyRate" "EducationField_Other" "TrainingTimesLastYear" "TotalWorkingYears"
[21] "JobRole_Healthcare Representative" "Education" "BusinessTravel_Non-Travel" "JobRole_Sales Representative"
[25] "EducationField_Medical" "JobRole_Laboratory Technician" "NumCompaniesWorked" "JobRole_Research Director"
[29] "Department_Research & Development" "JobRole_Research Scientist" "MaritalStatus_Divorced" "YearsInCurrentRole"
[33] "JobRole_Manager" "EducationField_Marketing" "YearsSinceLastPromotion" "JobRole_Human Resources"
[37] "EducationField_Life Sciences" "MaritalStatus_Married" "StockOptionLevel" "Age"
[41] "YearsWithCurrManager" "JobRole_Sales Executive" "BusinessTravel_Travel_Rarely" "YearsAtCompany"
[45] "PerformanceRating" "JobLevel" "Department_Human Resources" "Department_Sales"
[49] "MonthlyIncome" "OverTime_Yes" "Gender_Male"
PCA for feature reduction
pcaTraining
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14 PC15 PC16
Standard deviation 2.3239 1.87827 1.62404 1.49462 1.4513 1.42555 1.40214 1.34540 1.33444 1.26975 1.2323 1.15566 1.13587 1.11399 1.10214 1.08224
Proportion of Variance 0.1039 0.06784 0.05072 0.04296 0.0405 0.03908 0.03781 0.03481 0.03424 0.03101 0.0292 0.02568 0.02481 0.02386 0.02336 0.02252
Cumulative Proportion 0.1039 0.17170 0.22242 0.26538 0.3059 0.34497 0.38278 0.41759 0.45183 0.48284 0.5120 0.53772 0.56253 0.58640 0.60976 0.63228
PC17 PC18 PC19 PC20 PC21 PC22 PC23 PC24 PC25 PC26 PC27 PC28 PC29 PC30 PC31 PC32
Standard deviation 1.07025 1.06542 1.05691 1.03716 1.01844 1.00651 1.00119 0.98751 0.98057 0.97037 0.96322 0.95597 0.93864 0.93224 0.89044 0.83044
Proportion of Variance 0.02203 0.02183 0.02148 0.02069 0.01995 0.01948 0.01928 0.01875 0.01849 0.01811 0.01784 0.01757 0.01694 0.01671 0.01525 0.01326
Cumulative Proportion 0.65431 0.67614 0.69762 0.71831 0.73825 0.75773 0.77701 0.79576 0.81425 0.83236 0.85020 0.86778 0.88472 0.90144 0.91668 0.92995
PC33 PC34 PC35 PC36 PC37 PC38 PC39 PC40 PC41 PC42 PC43 PC44 PC45 PC46 PC47
Standard deviation 0.80075 0.72875 0.69784 0.69227 0.59715 0.52690 0.47494 0.46776 0.38485 0.33312 0.28375 0.22710 0.18605 2.873e-15 1.676e-15
Proportion of Variance 0.01233 0.01021 0.00937 0.00922 0.00686 0.00534 0.00434 0.00421 0.00285 0.00213 0.00155 0.00099 0.00067 0.000e+00 0.000e+00
Cumulative Proportion 0.94228 0.95249 0.96185 0.97107 0.97793 0.98327 0.98760 0.99181 0.99466 0.99679 0.99834 0.99933 1.00000 1.000e+00 1.000e+00
PC48 PC49 PC50 PC51 PC52
Standard deviation 1.202e-15 1.091e-15 8.913e-16 6.633e-16 3.165e-16
Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
Cumulative Proportion 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
y<-ggplot(hr[TotalWorkingYears > 0 & TotalWorkingYears < 10,.N,by=list(Attrition,OverTime)],aes(x=OverTime,y=N,fill=Attrition))+geom_bar(stat = "identity",width = 0.2,position = "dodge")+geom_text(aes(label=N), vjust=-0.30)+labs(x ="overtime", y = "Number of Employee")+ylim(0,400)
ggplotly(y)
The conclusion ,the most of the employees who have worked 0-5 years , their salary is less than 5000$,and are doing overtime and the job level 1 have higher probabilty of leaving the company.
JobRole ,Age, EnvironmentSatisfaction ,YearsAtCompany
x<-ggplot(hr,aes(YearsAtCompany,MonthlyIncome,color=EnvironmentSatisfaction,shape=Age))+geom_point()+facet_wrap(~Attrition)+labs(x = "Years At the Company",y="Monthly_Income")
ggplotly(x)
As we can see above that the employee of age group 20-30,number of years in this company is less than 5 and monthly salary is belowe $5000 are leaving the company the most.
Now we are knowe that at job level 1 the most of the employee ’s are leaving. so now i am going to check in job level 1 in which role the employees are leaving the most.
plotting the bar plot for the data representation.
ggplot(hr[JobLevel==1 & MonthlyIncome<5000 &TotalWorkingYears<=5 &YearsAtCompany<5 & Age=="20-30",.N,by=list(JobRole,Attrition,Department)],aes(x=JobRole,y=N,fill=Department))+geom_bar(stat = "identity",width = 0.2)+labs(x ="jobRole", y = "Number of employees")+facet_wrap(~Attrition)+coord_flip()+geom_text(aes(label=N), vjust=-0.2)
from this we can conclude that in job level 1,Age group is 20-30 ,monthly income is less than 5000$ ,totalworking years are less than 5 years,and years at company less than equal to 5 In job level one There are three department 1)Human Resource 2)Research and development 3)sales.
In job level one we have 4 job roles in a Department :- 1)Laboratory Technician 2)Research Scientist 3)Sales Representative 4)Human Resources
where the most of the employee ’s are leaving in job role of Laboratory Technician of Research and development department and Sales Representative Sales department.
MaritalStatus,WorkLifeBalance
x<-ggplot(hr,aes(WorkLifeBalance,MonthlyIncome ,color=MaritalStatus))+geom_point()+facet_wrap(~Attrition)+labs(x = "WorkLifeBalance",y="Monthly_Income")
ggplotly(x)
By this graph we can conclude that the employees who ’s monthy income is less than 10000$ ,Marital status is single and worklife balance are Better are leaving the company.